This paper addresses online query processing for large-scale, incrementaldata analysis on a distributed stream processing engine (DSPE). Our goal is toconvert any SQL-like query to an incremental DSPE program automatically. Incontrast to other approaches, we derive incremental programs that returnaccurate results, not approximate answers. This is accomplished by retaining aminimal state during the query evaluation lifetime and by using incrementalevaluation techniques to return an accurate snapshot answer at each timeinterval that depends on the current state and the latest batches of data. Ourmethods can handle many forms of queries on nested data collections, includingiterative and nested queries, group-by with aggregation, and equi-joins.Finally, we report on a prototype implementation of our framework, called MRQLStreaming, running on top of Spark and we experimentally validate theeffectiveness of our methods.
展开▼